Template-Independent Web Object Extraction

نویسندگان

  • Zaiqing Nie
  • Fei Wu
  • Ji-Rong Wen
  • Wei-Ying Ma
چکیده

There are various kinds of objects embedded in static Web pages and online Web databases. Extracting and integrating these objects from the Web is of great significance for Web data management. The existing Web information extraction (IE) techniques cannot provide satisfactory solution to the Web object extraction task since objects of the same type are distributed in diverse Web sources, whose structures are highly heterogeneous. The classic information extraction (IE) methods, which are designed for processing plain text documents, also fail to meet our requirements. In this paper, we propose a novel approach called Object-Level Information Extraction (OLIE) to extract Web objects. This approach extends a classic IE algorithm, Conditional Random Fields (CRF), by adding Web-specific information. It is essentially a combination of Web IE and classic IE. Specifically, visual information on the Web pages is used to select appropriate atomic elements for extraction and also to distinguish attributes, and structured information from external Web databases is applied to assist the extraction process. The experimental results show OLIE can significantly improve the Web object extraction accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Site-Independent Template-Block Detection

Detection of template and noise blocks in web pages is an important step in improving the performance of information retrieval and content extraction. Of the many approaches proposed, most rely on the assumption of operating within the confines of a single website or require expensive hand-labeling of relevant and non-relevant blocks for model induction. This reduces their applicability, since ...

متن کامل

Inférer des Objets Sémantiques du Web Structuré

This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web...

متن کامل

Web Template Extraction Based on Hyperlink Analysis

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important ...

متن کامل

Effective and Enhanced method for Template Extraction from Heterogeneous Web Pages

To achieve high productivity publishing the web pages are automatically evaluated using common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. Cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. This process propo...

متن کامل

Site-Level Web Template Extraction Based on DOM Analysis

One of the main development resources for website engineers are Web templates. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005